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We propose new bounds on the error of learning algorithms in 
terms of a data-dependent notion of complexity. The estimates we 
establish give optimal rates and are based on a local and empirical 
version of Rademacher averages, in the sense that the Rademacher 
averages are computed from the data, on a subset of functions with 
small empirical error. We present some applications to classification 
and prediction with convex function classes, and with kernel classes 
in particular. 

1. Introduction. Estimating the performance of statistical procedures 
is useful for providing a better understanding of the factors that influence 
their behavior, as well as for suggesting ways to improve them. Although 
asymptotic analysis is a crucial first step toward understanding the behavior, 
finite sample error bounds are of more value as they allow the design of model 
selection (or parameter tuning) procedures. These error bounds typically 
have the following form: with high probability, the error of the estimator 
(typically a function in a certain class) is bounded by an empirical estimate 
of error plus a penalty term depending on the complexity of the class of 
functions that can be chosen by the algorithm. The differences between the 
true and empirical errors of functions in that class can be viewed as an 
empirical process. Many tools have been developed for understanding the 
behavior of such objects, and especially for evaluating their suprema — which 
can be thought of as a measure of how hard it is to estimate functions in 
the class at hand. The goal is thus to obtain the sharpest possible estimates 
on the complexity of function classes. A problem arises since the notion of 
complexity might depend on the (unknown) underlying probability measure 
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according to which the data is produced. Distribution- free notions of the 
complexity, such as the Vapnik-Chervonenkis dimension [35] or the metric 
entropy [28], typically give conservative estimates. Distribution-dependent 
estimates, based for example on entropy numbers in the L2{P) distance, 
where P is the underlying distribution, are not useful when P is unknown. 
Thus, it is desirable to obtain data-dependent estimates which can readily 
be computed from the sample. 

One of the most interesting data-dependent complexity estimates is the 
so-called Rademacher average associated with the class. Although known for 
a long time to be related to the expected supremum of the empirical process 
(thanks to symmetrization inequalities), it was first proposed as an effective 
complexity measure by Koltchinskii [15], Bartlett, Boucheron and Lugosi 
[1] and Mendelson [25] and then further studied in [3]. Unfortunately, one 
of the shortcomings of the Rademacher averages is that they provide global 
estimates of the complexity of the function class, that is, they do not reflect 
the fact that the algorithm will likely pick functions that have a small error, 
and in particular, only a small subset of the function class will be used. As 
a result, the best error rate that can be obtained via the global Rademacher 
averages is at least of the order of 1/ ^/n (where n is the sample size) , which 
is suboptimal in some situations. Indeed, the type of algorithms we consider 
here are known in the statistical literature as M-estimators. They minimize 
an empirical loss criterion in a fixed class of functions. They have been 
extensively studied and their rate of convergence is known to be related 
to the modulus of continuity of the empirical process associated with the 
class of functions (rather than to the expected supremum of that empirical 
process). This modulus of continuity is well understood from the empirical 
processes theory viewpoint (see, e.g., [33, 34]). Also, from the point of view 
of M-estimators, the quantity which determines the rate of convergence is 
actually a fixed point of this modulus of continuity. Results of this type have 
been obtained by van de Geer [31, 32] (among others), who also provides 
nonasymptotic exponential inequalities. Unfortunately, these are in terms 
of entropy (or random entropy) and hence might not be useful when the 
probability distribution is unknown. 

The key property that allows one to prove fast rates of convergence is 
the fact that around the best function in the class, the variance of the incre- 
ments of the empirical process [or the L2{P) distance to the best function] is 
upper bounded by a linear function of the expectation of these increments. 
In the context of regression with squared loss, this happens as soon as the 
functions are bounded and the class of functions is convex. In the context of 
classification, Mammen and Tsybakov have shown [20] that this also hap- 
pens under conditions on the conditional distribution (especially about its 
behavior around 1/2). They actually do not require the relationship between 
variance and expectation (of the increments) to be linear but allow for more 
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general, power type inequalities. Their results, like those of van de Geer, are 
asymptotic. 

In order to exploit this key property and have finite sample bounds, rather 
than considering the Rademacher averages of the entire class as the complex- 
ity measure, it is possible to consider the Rademacher averages of a small 
subset of the class, usually the intersection of the class with a ball centered 
at a function of interest. These local Rademacher averages can serve as a 
complexity measure; clearly, they are always smaller than the corresponding 
global averages. Several authors have considered the use of local estimates 
of the complexity of the function class in order to obtain better bounds. 
Before presenting their results, we introduce some notation which is used 
throughout the paper. 

Let {X, P) be a probability space. Denote hy !F a class of measurable func- 
tions from X to M, and set Xi, . . . ,Xn to be independent random variables 
distributed according to P. Let o"i,...,cr„ be n independent Rademacher 
random variables, that is, independent random variables for which Pr(crj = 
l)=Pr(f7, = -l)=l/2. 

For a function f -.X define 

-, n 1 

n n 
1=1 1=1 

For a class set 

RnJ^ = sup Rnf. 

Define Eq- to be the expectation with respect to the random variables ai, . . . , o"„, 
conditioned on all of the other random variables. The Rademacher average 
of J- is E,RnJ-, and the empirical (or conditional) Rademacher averages of J-' 
are 

E^R,,T = -E ( sup V aif{Xi)\Xu ■■■,xA. 

Some classical properties of Rademacher averages and some simple lemmas 
(which we use often) are listed in Appendix A.l. 

The simplest way to obtain the property allowing for fast rates of conver- 
gence is to consider nonnegative uniformly bounded functions (or increments 
with respect to a fixed null function). In this case, one trivially has for all / G 
J^, Var[/] < cPf. This is exploited by Koltchinskii and Panchenko [16], who 
consider the case of prediction with absolute loss when functions in have 
values in [0, 1] and there is a perfect function /* in the class, that is, Pf* = 0. 
They introduce an iterative method involving local empirical Rademacher 
averages. They first construct a function (j)n{r) = ciRn{f ■ Pnf < 2r} + C2\/rx/n + 
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cs/n, which can be computed from the data. For r^v defined by tq = 1, 
ffc+i = (pni'f'k), they show that with probabihty at least 1 — 2Ne~^ , 

2t 

Pf<rN + -, 
n 

where / is a minimizer of the empirical error, that is, a function in T 
satisfying Pnf = mif^jrP^f. Hence, this nonincreasing sequence of local 
Rademacher averages can be used as upper bounds on the error of the 
empirical minimizer /. Furthermore, if ^l'n is a concave function such that 
'4'(V^) — ^<jRn{f G ^'-Pnf ^ f}, and if the number of iterations N is at 
least 1 + [log2 \og2n/x \ , then with probability at least 1 — Ne~^, 

TN < c(^r* + 

where r* is a solution of the fixed-point equation ip{\/r) = r. Combining the 
above results, one has a procedure to obtain data-dependent error bounds 
that are of the order of the fixed point of the modulus of continuity at of 
the empirical Rademacher averages. One limitation of this result is that it 
assumes that there is a function /* in the class with Pf* = 0. In contrast, we 
are interested in prediction problems where Pf is the error of an estimator, 
and in the presence of noise there may not be any perfect estimator (even 
the best in the class can have nonzero error). 

More recently, Bousquet, Koltchinskii and Panchenko [9] have obtained a 
more general result avoiding the iterative procedure. Their result is that for 
functions with values in [0, 1], with probability at least 1 — e~^, 

(1.1) V/E^ P/<c(p„/ + f* + i±^^), 

where f* is the fixed point of a concave function ipn satisfying V'n(O) = and 
MV^)> ^Mf <^:F:Pnf <r]. 

The main difference between this and the results of [16] is that there is no 
requirement that the class contain a perfect function. However, the local 
Rademacher averages are centered around the zero function instead of the 
one that minimizes Pf . As a consequence, the fixed point f* cannot be 
expected to converge to zero when mif^jrPf > 0. 

In order to remove this limitation, Lugosi and Wegkamp [19] use localized 
Rademacher averages of a small ball around the minimizer / of P„- However, 
their result is restricted to nonnegative functions, and in particular functions 
with values in {0,1}. Moreover, their bounds also involve some global in- 
formation, in the form of the shatter coefficients 5jf(X") of the function 
class (i.e., the cardinality of the coordinate projections of the class J- on 
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the data X"). They show that there are constants ci,C2 such that, with 
probabihty at least 1 — 8/?i, the empirical minimizer / satisfies 



P/< inf P/ + 2^„(f, 

/ 1 



where 



i^n{r) = ci \^Mf (^^■Pnf< 16P„/ + 15r} + ^ + f^^^nf + rj 

and rn = C2(log5jF(Xf ) + log n)/n. The limitation of this result is that r„ 
has to be chosen according to the (empirically measured) complexity of the 
whole class, which may not be as sharp as the Rademacher averages, and 
in general, is not a fixed point of V'n- Moreover, the balls over which the 
Rademacher averages are computed in tpn contain a factor of 16 in front 
of Pnf ■ As we explain later, this induces a lower bound on ij^n when there 
is no function with Pf = in the class. 

It seems that the only way to capture the right behavior in the general, 
noisy case is to analyze the increments of the empirical process, in other 
words, to directly consider the functions f — f* ■ This approach was first 
proposed by Massart [22]; see also [26]. Massart introduces the assumption 

V3r[if{X) - (X)] < d\f, n < BiPij - Pif,), 

where ^/ is the loss associated with the function / [in other words, if{X, Y) = 
i{f(X),Y), which measures the discrepancy in the prediction made by /], d 
is a pseudometric and /* minimizes the expected loss. (The previous results 
could also be stated in terms of loss functions, but we omitted this in order 
to simplify exposition. However, the extra notation is necessary to properly 
state Massart 's result.) This is a more refined version of the assumption we 
mentioned earlier on the relationship between the variance and expectation 
of the increments of the empirical process. It is only satisfied for some loss 
functions £ and function classes Under this assumption, Massart considers 
a nondecreasing function ip satisfying 

^P{r)>K sup \Pf-Pf*-P^f + p^f*\+c-, 
/e.^,d2(/,/*)2<r n 

such that ip{r)/ y/r is nonincreasing (we refer to this property as the sub-root 
property later in the paper). Then, with probability at least 1 — e~^, 

(1.2) V/GJ^ Pif - Pif* <c(^r* + ^ 

where r* is the fixed point of tp and c depends only on B and on the uni- 
form bound on the range of functions in JT. It can be proved that in many 
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situations of interest, this bound suffices to prove minimax rates of conver- 
gence for penalized M-estimators. (Massart considers examples where the 
complexity term can be bounded using a priori global information about the 
function class.) However, the main limitation of this result is that it does 
not involve quantities that can be computed from the data. 

Finally, as we mentioned earlier, Mendelson [26] gives an analysis similar 
to that of Massart, in a slightly less general case (with no noise in the target 
values, i.e., the conditional distribution of Y given X is concentrated at 
one point). Mendelson introduces the notion of the star-hull of a class of 
functions (see the next section for a definition) and considers Rademacher 
averages of this star-hull as a localized measure of complexity. His results 
also involve a priori knowledge of the class, such as the rate of growth of 
covering numbers. 

We can now spell out our goal in more detail: in this paper we com- 
bine the increment-based approach of Massart and Mendelson (dealing with 
differences of functions, or more generally with bounded real-valued func- 
tions) with the empirical local Rademacher approach of Koltchinskii and 
Panchenko and of Lugosi and Wegkamp, in order to obtain data-dependent 
bounds which depend on a fixed point of the modulus of continuity of 
Rademacher averages computed around the empirically best function. 

Our first main result (Theorem 3.3) is a distribution-dependent result 
involving the fixed point r* of a local Rademacher average of the star-hull 
of the class This shows that functions with the sub-root property can 
readily be obtained from Rademacher averages, while in previous work the 
appropriate functions were obtained only via global information about the 
class. 

The second main result (Theorems 4.1 and 4.2) is an empirical counterpart 
of the first one, where the complexity is the fixed point of an empirical local 
Rademacher average. We also show that this fixed point is within a constant 
factor of the nonempirical one. 

Equipped with this result, we can then prove (Theorem 5.4) a fully data- 
dependent analogue of Massart's result, where the Rademacher averages are 
localized around the minimizer of the empirical loss. 

We also show (Theorem 6.3) that in the context of classification, the 
local Rademacher averages of star-hulls can be approximated by solving a 
weighted empirical error minimization problem. 

Our final result (Corollary 6.7) concerns regression with kernel classes, 
that is, classes of functions that are generated by a positive definite ker- 
nel. These classes are widely used in interpolation and estimation problems 
as they yield computationally efficient algorithms. Our result gives a data- 
dependent complexity term that can be computed directly from the eigen- 
values of the Gram matrix (the matrix whose entries are values of the kernel 
on the data). 
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The sharpness of our results is demonstrated from the fact that we recover, 
in the distribution-dependent case (treated in Section 4), similar results 
to those of Massart [22], which, in the situations where they apply, give 
the minimax optimal rates or the best known results. Moreover, the data- 
dependent bounds that we obtain as counterparts of these results have the 
same rate of convergence (see Theorem 4.2). 

The paper is organized as follows. In Section 2 we present some prelimi- 
nary results obtained from concentration inequalities, which we use through- 
out. Section 3 establishes error bounds using local Rademacher averages and 
explains how to compute their fixed points from "global information" (e.g., 
estimates of the metric entropy or of the combinatorial dimensions of the 
indexing class), in which case the optimal estimates can be recovered. In 
Section 4 we give a data-dependent error bound using empirical and local 
Rademacher averages, and show the connection between the fixed points of 
the empirical and nonempirical Rademacher averages. In Section 5 we ap- 
ply our results to loss classes. We give estimates that generalize the results 
of Koltchinskii and Panchenko by eliminating the requirement that some 
function in the class have zero loss, and are more general than those of 
Lugosi and Wegkamp, since there is no need have in our case to estimate 
global shatter coefficients of the class. We also give a data-dependent exten- 
sion of Massart's result where the local averages are computed around the 
minimizer of the empirical loss. Finally, Section 6 shows that the problem 
of estimating these local Rademacher averages in classification reduces to 
weighted empirical risk minimization. It also shows that the local averages 
for kernel classes can be sharply bounded in terms of the eigenvalues of the 
Gram matrix. 

2. Preliminary results. Recall that the star-hull of T around /o is de- 
fined by 

star(.F, /o) = {/o + a(/ - /o) : / G .F, a G [0, 1]}. 

Throughout this paper, we will manipulate suprema of empirical processes, 
that is, quantities of the form supf^jr{Pf — Pnf)- We will always assume 
they are measurable without explicitly mentioning it. In other words, we 
assume that the class J- and the distribution P satisfy appropriate (mild) 
conditions for measurability of this supremum (we refer to [11, 28] for a 
detailed account of such issues). 

The following theorem is the main result of this section and is at the core 
of all the proofs presented later. It shows that if the functions in a class 
have small variance, the maximal deviation between empirical means and 
true means is controlled by the Rademacher averages of J^. In particular, 
the bound improves as the largest variance of a class member decreases. 
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Theorem 2.1. Let T he a class of functions that map X into [a, 6]. 
Assume that there is some r > such that for every f £ J^, Var[/(Xj)] < r. 
Then, for every x > 0, with probability at least 1 — , 

sup(P/ - Pnf) < inf I 2(1 + a)ERnJ^ + +(b-a)(l + -) -] , 

jgjF «>o\ Vn V-J a J n J 

and with probability at least 1 — 2e~^, 
sup(P/-P„/) 

( 1 + a /2fx , VI 1 1 + a \ a;\ 

aG(o,i)y 1-a V n V3 a 2a(l - a) / ny 

Moreover, the same results hold for the quantity supjgjp(P„/ — Pf). 

This theorem, which is proved in Appendix A. 2, is a more or less direct 
consequence of Talagrand's inequahty for empirical processes [30]. However, 
the actual statement presented here is new in the sense that it displays the 
best known constants. Indeed, compared to the previous result of Koltchin- 
skii and Panchenko [16] which was based on Massart's version of Talagrand's 
inequality [21], we have used the most refined concentration inequalities 
available: that of Bousquet [7] for the supremum of the empirical process 
and that of Boucheron, Lugosi and Massart [5] for the Rademacher averages. 
This last inequality is a powerful tool to obtain data-dependent bounds, 
since it allows one to replace the Rademacher average (which measures the 
complexity of the class of functions) by its empirical version, which can be 
efficiently computed in some cases. Details about these inequalities are given 
in Appendix A.l. 

When applied to the full function class the above theorem is not useful. 
Indeed, with only a trivial bound on the maximal variance, better results 
can be obtained via simpler concentration inequalities, such as the bounded 
difference inequality [23], which would allow ^Jrxjn to be replaced by \Jx/n. 
However, by applying Theorem 2.1 to subsets of or to modified classes 
obtained from much better results can be obtained. Hence, the presence of 
an upper bound on the variance in the square root term is the key ingredient 
of this result. 

A last preliminary result that we will require is the following consequence 
of Theorem 2.1, which shows that if the local Rademacher averages are small, 
then balls in L2{P) are probably contained in the corresponding empirical 
balls [i.e., in L2{Pn)] with a slightly larger radius. 

Corollary 2.2. Let T be a class of functions that map X into [—h,b] 
with 6 > 0. For every a; > and r that satisfy 

r > WbERnif :f£T,Pf^<r} + , 
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then with probability at least 1 — , 

{feJ':Pf<r}<Z {/ er-.Pnf < 2r}. 

Proof. Since the range of any function in the set J-r = {f^ ■ f S J', 
Pf < r} is contained in [0, 6^], it follows that Var[/2(Xi)] < Pf < h'^Pf < 
b^r. Thus, by the first part of Theorem 2.1 (with q = 1/4), with probability 
at least 1 — , every f £ satisfies 



<r + |Ei?„{/2:/Gj-,P/2<r} + ^ + l^ 
2 2 6n 

<r + 5bERn{f:feJ',Pf<r}+^- + ^^ 

2 6n 

< 2r, 

where the second inequality follows from Lemma A. 3 and we have used, 
in the second to last inequality, Theorem A. 6 applied to </>(x) = (with 
Lipschitz constant 26 on [—6,6]). □ 



3. Error bounds with local complexity. In this section we show that 
the Rademacher averages associated with a small subset of the class may 
be considered as a complexity term in an error bound. Since these local 
Rademacher averages are always smaller than the corresponding global av- 
erages, they lead to sharper bounds. 

We present a general error bound involving local complexities that is ap- 
plicable to classes of bounded functions for which the variance is bounded by 
a fixed linear function of the expectation. In this case the local Rademacher 
averages are defined as Ei?„{/ £ :T{f) < r} where T{f) is an upper bound 
on the variance [typically chosen as T(/) = Pf'^]- 

There is a trade-off between the size of the subset we consider in these 
local averages and its complexity; we shall see that the optimal choice is 
given by a fixed point of an upper bound on the local Rademacher averages. 
The functions we use as upper bounds are sub-root functions; among other 
useful properties, sub-root functions have a unique fixed point. 

Definition 3.1. A function -0: [0,oo) [0,00) is sub-root if it is non- 
negative, nondecreasing and if r ip{r)/\/r is nonincreasing for r > 0. 

We only consider nontrivial sub-root functions, that is, sub-root functions 
that are not the constant function = 0. 
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Lemma 3.2. If^ : [0, oo) [0, oo) is a nontrivial sub-root function, then 
it is continuous on [0, oo) and the equation il^{r) = r has a unique positive 
solution. Moreover, if we denote the solution by r* , then for all r > 0, r > 
il^{r) if and only if r* < r. 

The proof of this lemma is in Appendix A. 2. In view of the lemma, we will 
simply refer to the quantity r* as the unique positive solution of ^(r) = r, 
or as the fixed point of 

3.1. Error bounds. We can now state and discuss the main result of this 
section. It is composed of two parts: in the first part, one requires a sub-root 
upper bound on the local Rademacher averages, and in the second part, it 
is shown that better results can be obtained when the class over which the 
averages are computed is enlarged slightly. 

Theorem 3.3. Let J- be a class of functions with ranges in [a,b] and 
assume that there are some functional T '.T ^ M+ and some constant B such 
that for every f , Var[/] < T(/) < BP f . Let tp be a sub-root function and 
let r* be the fixed point of ip. 

1. Assume that ip satisfies, for any r >r* , 

xP{r)>B^Rn{f eT:T{f)<r}. 

Then, with ci = 704 and C2 = 26, for any K > 1 and every x > 0, with 
probability at least 1 — e~^ , 

K — 1 B n 

Also, with probability at least 1 — e~^ , 

KB n 

2. //, in addition, for f and a G [0,1], T{af) < a^T{f), and if ip 
satisfies, for any r >r* , 

V^(r) > BERn{f G star(.F, 0) : T{f) < r}, 

then the same results hold true with ci = 6 and C2 = 5. 



The proof of this theorem is given in Section 3.2. 

We can compare the results to our starting point (Theorem 2.1). The 
improvement comes from the fact that the complexity term, which was es- 
sentially supj.tpi''') in Theorem 2.1 (if we had applied it to the class di- 
rectly) is now reduced to r*, the fixed point oi ip. So the complexity term 
is always smaller (later, we show how to estimate r*). On the other hand. 



LOCAL RADEMACHER COMPLEXITIES 



11 



there is some loss since the constant in front of P„/ is strictly larger than 1. 
Section 5.2 will show that this is not an issue in the applications we have in 
mind. 

In Sections 5.1 and 5.2 we investigate conditions that ensure the assump- 
tions of this theorem are satisfied, and we provide applications of this result 
to prediction problems. The condition that the variance is upper bounded 
by the expectation turns out to be crucial to obtain these results. 

The idea behind Theorem 3.3 originates in the work of Massart [22], who 
proves a slightly different version of the first part. The difference is that we 
use local Rademacher averages instead of the expectation of the supremum 
of the empirical process on a ball. Moreover, we give smaller constants. As 
far as we know, the second part of Theorem 3.3 is new. 

3.1.1. Choosing the function tp. Notice that the function ip cannot be 
chosen arbitrarily and has to satisfy the sub-root property. One possible 
approach is to use classical upper bounds on the Rademacher averages, such 
as Dudley's entropy integral. This can give a sub-root upper bound and was 
used, for example, in [16] and in [22]. 

However, the second part of Theorem 3.3 indicates a possible choice for 
namely, one can take ^p as the local Rademacher averages of the star- 
hull of J- around 0. The reason for this comes from the following lemma, 
which shows that if the class is star-shaped and T{f) behaves as a quadratic 
function, the Rademacher averages are sub-root. 

Lemma 3.4. // the class J- is star-shaped around f (which may depend 
on the data), and T:J^^M^ is a (possibly random) function that satis- 
fies T{af) < a^T{f) for any f ^ T and any a G [0, 1], then the (random) 
function i\) defined for r >0 by 

i^{r) = EMf ^ ^-Tif - f) <r} 
is sub-root and r E,ip{r) is also sub-root. 

This lemma is proved in Appendix A. 2. 

Notice that making a class star-shaped only increases it, so that 

ERnif G star(.F, /o) : T{f) <r}> Ei?„{/ G J^:T{f) < r}. 

However, this increase in size is moderate as can be seen, for example, if 
one compares covering numbers of a class and its star-hull (see, e.g., [26], 
Lemma 4.5). 
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3.1.2. Some consequences. As a consequence of Theorem 3.3, we obtain 
an error bound when T consists of uniformly bounded nonnegative functions. 
Notice that in this case the variance is trivially bounded by a constant times 
the expectation and one can directly use T(/) = Pf. 

Corollary 3.5. Let T he a class of functions with ranges in [0, 1]. Let 
ij) he a sub-root function, such that for all r > 0, 

ERn{feJ':Pf<r}<tl;{r), 

and let r* he the fixed point ofip. Then, for any K > 1 and every x > 0, with 
probahility at least 1 — e~^ , every f satisfies 

K — 1 n 
Also, with probahility at least 1 — e~^ , every f satisfies 

K n 

Proof. When / G [0, 1], we have Var[/] < Pf so that the result follows 
from applying Theorem 3.3 with T(/) = P f . □ 

We also note that the same idea as in the proof of Theorem 3.3 gives a 
converse of Corollary 2.2, namely, that with high probability the intersection 
of T with an empirical ball of a fixed radius is contained in the intersection 
of T with an L^iP^ ball with a slightly larger radius. 

Lemma 3.6. Let T he a class of functions that map X into [—1,1]. Fix 
X > 0. // 

26x 

r > 20Ei?„{/ : / G star(.^, 0), P/^ < r} + , 

n 

then with probability at least 1 — e~^ , 

{/ E star(.F, 0) : PJ^ < r} C {/ G star(.F, 0) : Pf^ < 2r}. 

This result, proved in Section 3.2, will be useful in Section 4. 

3.1.3. Estimating r* from global information. The error bounds involve 
fixed points of functions that define upper bounds on the local Rademacher 
averages. In some cases these fixed points can be estimated from global 
information on the function class. We present a complete analysis only in 
a simple case, where ^ is a class of binary-valued functions with a finite 
VC-dimension. 
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Corollary 3.7. Let T he a class o/{0, 1} -valued functions with VC-dimen- 
sion d < oo. Then for all K > 1 and every a; > 0, with probability at least 
1 — e~^ , every f satisfies 



The proof is in Appendix A. 2. 

The above result is similar to results obtained by Vapnik and Chervo- 
nenkis [35] and by Lugosi and Wegkamp (Theorem 3.1 of [19]). However, 
they used inequalities for weighted empirical processes indexed by nonnega- 
tive functions. Our results have more flexibility since they can accommodate 
general functions, although this is not needed in this simple corollary. 

The proof uses a similar line of reasoning to proofs in [26, 27]. Clearly, 
it extends to any class of real-valued functions for which one has estimates 
for the entropy integral, such as classes with finite pseudo-dimension or a 
combinatorial dimension that grows more slowly than quadratically. See [26, 
27] for more details. 

Notice also that the rate of logn/?i is the best known. 

3.1.4. Proof techniques. Before giving the proofs of the results mentioned 
above, let us sketch the techniques we use. The approach has its roots in 
classical empirical processes theory, where it was understood that the mod- 
ulus of continuity of the empirical process is an important quantity (here 
t/' plays this role). In order to obtain nonasymptotic results, two approaches 
have been developed: the first one consists of cutting the class J-' into smaller 
pieces, where one has control of the variance of the elements. This is the so- 
called peeling technique (see, e.g., [31, 32, 33, 34] and references therein). 
The second approach consists of weighting the functions in by dividing 
them by their variance. Many results have been obtained on such weighted 
empirical processes (see, e.g., [28]). The results of Vapnik and Chervonenkis 
based on weighting [35] are restricted to classes of nonnegative functions. 
Also, most previous results, such as those of Pollard [28], van de Geer [32] 
or Haussler [13], give complexity terms that involve "global" measures of 
complexity of the class, such as covering numbers. None of these results uses 
the recently introduced Rademacher averages as measures of complexity. 
It turns out that it is possible to combine the peeling and weighting ideas 
with concentration inequalities to obtain such results, as proposed by Mas- 
sart in [22], and also used (for nonnegative functions) by Koltchinskii and 
Panchenko [16]. 

The idea is the following: 

(a) Apply Theorem 2.1 to the class of functions {f /w{f) '■ f G J^}, where 
w is some nonnegative weight of the order of the variance of /. Hence, the 
functions in this class have a small variance. 
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(b) Upper bound the Rademacher averages of this weighted class, by 
"peehng off" subclasses of T according to the variance of their elements, 
and bounding the Rademacher averages of these subclasses using ■0. 

(c) Use the sub-root property of -i/', so that its fixed point gives a common 
upper bound on the complexity of all the subclasses (up to some scaling). 

(d) Finally, convert the upper bound for functions in the weighted class 
into a bound for functions in the initial class. 

The idea of peeling — that is, of partitioning the class T into slices where 
functions have variance within a certain range — is at the core of the proof of 
the first part of Theorem 3.3 [see, e.g., (3.1)]. However, it does not appear 
explicitly in the proof of the second part. One explanation is that when one 
considers the star-hull of the class, it is enough to consider two subclasses: 
the functions with T(/) < r and the ones with T(/) > r, and this is done 
by introducing the weighting factor T(/) V r. This idea was exploited in 
the work of Mendelson [26] and, more recently, in [4]. Moreover, when one 
considers the set Tr = star(.F, 0) H {T{f) < r}, any function f £ T with 
T{f') > r will have a scaled down representative in that set. So even though 
it seems that we look at the class star(^, 0) only locally, we still take into 
account all of the functions in (with appropriate scaling). 

3.2. Proofs. Before presenting the proof, let us first introduce some ad- 
ditional notation. Given a class J-,X>1 and r > 0, let w{f) = minjrA'^ : k G 



Notice that w{f) > r, so that Gr ^ {af : f £ J^,a e [0, 1]} = star(J^, 0). Define 
1// = sup Pg - PnQ and V~ = sup PnQ - Pg. 



For the second part of the theorem, we need to introduce another class of 
functions. 



N,rA^ >r(/)} and set 




g&Sr 



g&Sr 




and define 




sup Pg - Png and = sup Png - Pg. 

g&Qr g&Qr 



Lemma 3.8. With the above notation, assume that there is a constant 
B>0 such that for every f T{f) < BPf. Fix K>1, A > and r > 0. 
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IfV+<r/{\BK), then 

Also, ifV- < r/{\BK), then 

K + 1 T 



K ' \BK 

Similarly, if K > I and r > are such that <r j (BK) , then 



K-1 ' BK 
Also, ifV~ <r/{BK), then 

K + 1 r 
V/G^ Pnf<^^Pf + 



K ' BK 

Proof. Notice that for all geGr, Pg< Png + V^:^. Fix / G and define 
9 = i^f /w{f). When T{f) < r, w{f) = r, so that g = f. Thus, the fact that 
Pg < Png + V+ implies that Pf < P„/ + V+ < Pnf + r/{XBK). 

On the other hand, if T{f) > r, then w{f) = rX^ with k> and T(/) G 
{r\^~^,rX^]. Moreover, g = f/X^, Pg < Png + ■, and thus 

< + y+ 
\k - xk ^''r ■ 

Using the fact that r(/) > rA''^\ it follows that 

Pf < Pnf + AV/ < Pnf + XT{f)V+/r < Pnf + Pf/K. 

Rearranging, 

K K r 

Pf < -r^Pnf < -f^Pnf + 



K-1 ' K-1 ' XBK 

The proof of the second result is similar. For the third and fourth results, 
the reasoning is the same. □ 

Proof of Theorem 3.3, first part. Let Qr be defined as above, 
where r is chosen such that r > r*, and note that functions in Qr satisfy 
\\g — Pfi'lloo <b — a since < r/w{f) < 1. Also, we have Var[(5r] < r. Indeed, if 
Tif) < r, then g = f, and thus Var[g] = Var[/] < r. Otherwise, when T{f) > 
r, g = f jX^ (where k is such that T{f) G (rA'^"^, rA'^]), so that Var[5'] = 
Var[/]/A2'= <r. 

Applying Theorem 2.1, for all x > 0, with probability 1 — e~^, 

\2rx f 1 1\ X 

< 2(1 + a)ERngr + \ + (6 - a) - + - -. 

V n \3 a J n 
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Let J^{x,y) := {f £ : X < T{f) < y} and define k to be the smallest 
integer such that rX^~^^ > Bb. Then 

ERnGr < ERnJ^iO, r) +E sup -^Rnf 



<Ei?„^(0,r) + VE sup -^Rnf 

j=0 fe:F{r\3,r\3+^)W{J) 

(3.1) 

k 

= ERnJ='{0, r)+Y^ -^"^lE sup Rnf 

j=0 feHr>^i,r\j+i} 

j=0 



By our assumption it follows that for /3 > 1, ip{Pr) < y/j3ip[r). Hence, 

and taking A = 4, the right-hand side is upper bounded by 5^{r)/B. More- 



over, for r > r* , ip{r) < ^Jr /r*'il){r*) = \/rr* , and thus 



, , lO(l-ha) , „ VI l\x 

y+<^^, — -W^ +\j— + {h-a){-+ ^ 



B V n \3 a/ n 

Set A = 10(l-ha)^/?^/B + v/2x7nandC' = (6-a)(l/3-hl/a)x/n, and note 
that V+ <A^+C. 

We now show that r can be chosen such that < r/{\BK). Indeed, 
consider the largest solution rg of A^/r -\- C = r/(\BK). It satisfies ro > 
X^A^B^K^/l > r* and ro < [XBKf + 2\BKC , so that applying Lemma 3.8, 
it follows that every f £ satisfies 

Pf < -J^Pnf + XBKA^ + 2C 

A — i 

= J£-P„f + ,BK { 100(1 + + 20(1+0) /Sf^^ 2.\ 

K — 1 \ n y n n I 



+ (^-«)( 



1 1\ X 

3 a J n 



Setting a = 1/10 and using Lemma A. 3 to show that \J1xr* jn < Bx/{5n) + 
5r*/{2B) completes the proof of the first statement. The second statement 
is proved in the same way, by considering V~ instead of V^^ . □ 
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Proof of Theorem 3.3, second part. The proof of this result uses 
the same argument as for the first part. However, we consider the class Qr 
defined above. One can easily check that C {/ G star(.7^, 0) ■T{f) < r}, 
and thus KRnGr < 'tp{r)/B. Applying Theorem 2.1 to Gri it follows that, for 
all X > 0, with probability 1 — e~^. 



~, ^2(l + a),,, /2r^ ,, ,/l l\x 

K"'< „ V O + V + ib-a)i- + -)-. 

B Vn \6 a J n 

The reasoning is then the same as for the first part, and we use in the 
very last step that ^J2xr* /n < Bx/n + r* /{2B), which gives the displayed 
constants. □ 

Proof of Lemma 3.6. The map a'^ is Lipschitz with constant 2 
when a is restricted to [—1,1]. Applying Theorem A. 6, 

26x 

(3.2) r > 10Ei?„{/2 : / € star( J", 0),Pf<r} + . 

n 

Clearly, if / G J", then f maps to [0,1] and Var[/2] < Pp. Thus, Theo- 
rem 2.1 can be applied to the class Gr = {fp/{Pp \/ r): f ^ JT}, whose 
functions have range in [0, 1] and variance bounded by r. Therefore, with 
probability at least 1 — e~^, every f ^ J- satisfies 



Pf^Vr V n \3 a J n 



r < 2(1 + a)¥.RnGr + \ + o + --- 



Select a = 1/4 and notice that \j2rx/n <r/A + 2x/n to get 

Pf-Pnf ^ r 19x 

r < -ERnGr -\ \ • 

P/2 V r - 2 2 3n 

Hence, one either has Pf^ < r, or when Pf^ > r, since it was assumed that 
Pnf < r, 

r \2 4 3n 

Now, if 5 e Gr, there exists fo£J^ such that g = rf^/{Pf§ V r). If Pf§ < r, 
then g = /q. On the other hand, if P/q > r, then g = v/q/P/q = /f with 
/i G star(^, 0) and P/f < r, which shows that 

ERnGr < ERnif : / G star (.7^, 0),Pf < r}. 

Thus, by (3.2), Pf^ < 2r, which concludes the proof. □ 
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4. Data- dependent error bounds. The results presented thus far use 
distribution-dependent measures of complexity of the class at hand. In- 
deed, the sub-root function ip of Theorem 3.3 is bounded in terms of the 
Rademacher averages of the star-hull of J^, but these averages can only be 
computed if one knows the distribution P. Otherwise, we have seen that it 
is possible to compute an upper bound on the Rademacher averages using a 
priori global or distribution-free knowledge about the complexity of the class 
at hand (such as the VC-dimension) . In this section we present error bounds 
that can be computed directly from the data, without a priori information. 
Instead of computing ip, we compute an estimate, V'm of it. The function ipn 
is defined using the data and is an upper bound on -0 with high probability. 

To simplify the exposition we restrict ourselves to the case where the func- 
tions have a range which is symmetric around zero, say [—1,1]. Moreover, 
we can only treat the special case where T{f) = Pf, but this IS a minor 
restriction as in most applications this is the function of interest [i.e., for 
which one can show T{f) < BPf]. 

4.1. Results. We now present the main result of this section, which gives 
an analogue of the second part of Theorem 3.3, with a completely empirical 
bound (i.e., the bound can be computed from the data only). 

Theorem 4.1. Let T he a class of functions with ranges in [—1,1] and 
assume that there is some constant B such that for every f , Pf^ < BPf. 
Let ipn be a sub-root function and let f* be the fixed point of ipn- Fix a; > 
and assume that ipn satisfies, for any r>f*, 

C2X 



Mr) > ciK^Rnlf G star(.F, 0) : < 2r} + 



n 



where c\ = 2(10 V B) and C2 = ci + 11. Then, for any K > 1 with probability 
at least 1 — 3e~^ , 

K — 1 B n 

Also, with probability at least 1 — 3e~^ , 

KB n 

Although these are data-dependent bounds, they are not necessarily easy 
to compute. There are, however, favorable interesting situations where they 
can be computed efficiently, as Section 6 shows. 

It is natural to wonder how close the quantity f* appearing in the above 
theorem is to the quantity r* of Theorem 3.3. The next theorem shows that 
they are close with high probability. 
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Theorem 4.2. Let T he a class of functions with ranges in [—1,1]. Fix 
x > and consider the sub-root functions 

ip{r) = ERnif G star(J", 0) : P/^ < r} 

and 

Mr) = ciE^Rnif G star(^, 0) : < 2r} + 

n 

with fixed points r* andr*, respectively, and with ci = 2{10\/ B) and C2 = 13. 
Assume that r* > csx/n, where cs = 26 V (c2 + 2ci)/3. Then, with probability 
at least 1 — 4e~^, 

r*<f* <9(l + ci)V*. 

Thus, with high probabihty, f* is an upper bound on r* and has the 
same asymptotic behavior. Notice that there was no attempt to optimize 
the constants in the above theorem. In addition, the constant 9(1 + ci)'^ 
(equal to 3969 B < 10) in Theorem 4.2 does not appear in the upper 
bound of Theorem 4.1. 

4.2. Proofs. The idea of the proofs is to show that one can upper bound 
^jJ by an empirical estimate (with high probability). This requires two steps: 
the first one uses the concentration of the Rademacher averages to upper 
bound the expected Rademacher averages by their empirical versions. The 
second step uses Corollary 2.2 to prove that the ball over which the averages 
are computed [which is an L2{P) ball] can be replaced by an empirical one. 
Thus, V n is an upper bound on ^, and one can apply Theorem 3.3, together 
with the following lemma, which shows how fixed points of sub-root functions 
relate when the functions are ordered. 

Lemma 4.3. Suppose that ^,^n are sub-root. Let r* (resp. r* ) be the 
fixed point of tp (resp. tpn)- If for < a < 1 we have aipn{r*) < •ip{r*) < 
ipnir*), then 

a r < r < r . 

Proof. Denoting by f* the fixed point of the sub-root function at/^m 
then, by Lemma 3.2 f* < r* < f*. Also, since '\\)n is sub-root, ^ri,(«^^*) > 
onl^n^f*) =ar*, which means aV'n(ct^'^*) > oP'f* . Hence, Lemma 3.2 yields 
f*>a^f*. □ 

Proof of Theorem 4.1. Consider the sub-root function 
^x[r) = %ERn{f G star(.F, 0) : P/^ < r} + 1^1^^, 



20 P. L. BARTLETT, O. BOUSQUET AND S. MENDELSON 

with fixed point rj. Applying Corollary 2.2 when r > iJi{r), it follows that 
with probability at least 1 — e~^, 

{/ E star(^, 0) : Pf < r} C {/ G star(^, 0) : Pnf < 2r}. 

Using this together with the first inequality of Lemma A. 4 (with a = 1/2) 
shows that if r > V'i(^)) with probability at least 1 — 2e~^, 

V^i (r) = ^ERM e star 0) : Pf <r} + i^l^^ 
2 n 

< ciE^Rnif G star( J", 0) :Pf <r} + — 

n 

< ciE^RnU G star(^, 0) : Pnf < 2r} + ^ 

n 

< Tpnir). 

Choosing r = r\, Lemma 4.3 shows that with probability at least 1 — 2e~^, 

(4.1) rl<f*. 

Also, for all r > 0, 

i^i (r) > BERnif e star(.F, 0) : Pf < r}, 

and so from Theorem 3.3, with probability at least 1 — e~^, every f £ J- 
satisfies 

Pf< ^ 6Krt (11 + 55^ 

^^-K-1 B ^ 

Combining this with (4.1) gives the first result. The second result is proved 
in a similar manner. □ 

Proof of Theorem 4.2. Consider the functions 

Mr) = ^KRnif G star(.F,0) <r} + 1^1^^ 
2 n 

and 

Mr) = ciERM G star(.F, 0):Pf<r} + ^, 

n 

and denote by r\ and r2 the fixed points of ipi and ■02 1 respectively. The 
proof of Theorem 4.1 shows that with probability at least 1 — 2e~^, r\ <r* . 

Now apply Lemma 3.6 to show that if r > tp2{r), then with probability at 
least 1 — e~^, 

{/ G star(.F, 0) : P^f < r} C {/ G star(.F, 0) : Pf < 2r}. 
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Using this together with the second inequality of Lemma A. 4 (with a = 1/2) 
shows that if r > ^2(^)5 with probability at least 1 — 2e~^, 

Mr) = ciEMf e star(^, 0) : Pnf < 2r} + ^ 



n 



< ciV2K^Rn{f e star(J-, 0) : Pnf < r} + ^ 



< ciV2EaRn{f G star(J^, 0) : Pf < 2r} + 



n 

C2X 



n 



< ^cERM G star(^, 0) : Pf < 2r] + 1^1±^ 

2 n 

< 2,c{&Rn{f £ star(J", 0) : Pf <r} + 



2 / „i , ('^2 + 2ci)x 



n 

< 3^2 (r), 

where the sub-root property was used twice (in the first and second to last 
inequalities). Lemma 4.3 thus gives f* < Org. 

Also notice that for all r, ^{r) < ipi{r), and hence r* < rj. Moreover, for 
all r > il^ir) (hence r > r* > c^x/n), ^2(1") < ci^(r) + r, so that ip2{r*) < 
(ci + l)r* = (ci + l)ilj{r*). Lemma 4.3 implies that rg < (1 + Cifr*. □ 

5. Prediction with bounded loss. In this section we discuss the applica- 
tion of our results to prediction problems, such as classification and regres- 
sion. For such problems there are an input space X and an output space y, 
and the product ^ x 3^ is endowed with an unknown probability measure 
P. For example, classification corresponds to the case where 3^ is discrete, 
typically 3^ = { — 1,1}, and regression corresponds to the continuous case, 
typically y = [—1, 1]. Note that assuming the boundedness of the target val- 
ues is a typical assumption in theoretical analysis of regression procedures. 
To analyze the case of unbounded targets, one usually truncates the values 
at a certain threshold and bounds the probability of exceeding that threshold 
(see, e.g., the techniques developed in [12]). 

The training sample is a sequence {Xi,Yi), . . . , (X„, Yn) of n independent 
and identically distributed (i.i.d.) pairs sampled according to P. A loss func- 
tion ^ : 3^ X 3^ ^ [0, 1] is defined and the goal is to find a function f -.X ^y 
from a class J- that minimizes the expected loss 

mf = mu{x),Y). 

Since the probability distribution P is unknown, one cannot directly mini- 
mize the expected loss over T. 

The key property that is needed to apply our results is the fact that 
Var[/] < BPf (or Pf < BPf to obtain data-dependent bounds). This will 
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trivially be the case for the class {^f-f G J^}, as all its functions are uni- 
formly bounded and nonnegative. This case, studied in Section 5.1, is, how- 
ever, not the most interesting. Indeed, it is when one studies the excess risk 
— if* that our approach shows its superiority over previous ones; when 
the class {if — ij*} satisfies the variance condition (and Section 5.2 gives 
examples of this), we obtain distribution-dependent bounds that are optimal 
in certain cases, and data-dependent bounds of the same order. 

5.1. General results without assumptions. Define the following class of 
functions, called the loss class associated with T: 

iT = {if-!^T} = {(x, y) ^ i{fix),y) : / G .7^}. 

Notice that ij^ is a class of nonnegative functions. Applying Theorem 4.1 to 
this class of functions gives the following corollary. 

Corollary 5.1. For a loss function i -.y x y ^ [0,1], define 

^ 13x 

Mr) = 20EMf e star(£^, 0) : Pnf < 2r} + , 

n 

with fixed point r* . Then, for any K > 1 with probability at least 1 — 3e~^ , 

yfe:F Pif< -^p^if + 6Kr + -Jll±^. 

K — I n 

A natural approach is to minimize the empirical loss Pnif over the class 
J^. The following result shows that this approach leads to an estimate with 
expected loss near minimal. How close it is to the minimal expected loss 
depends on the value of the minimum, as well as on the local Rademacher 
averages of the class. 

Theorem 5.2. For a loss function i-.y x y ^ [0,1], define ilj{r), tjjnir), 
r* and r* as in Theorem 5.1. Let L* = in! f^jr Pi j- . Then there is a constant 
c such that with probability at least 1 — 2e~^, the minimizer f £ of Pnif 
satisfies 

Pif <L* + c{VL*r* + r*). 
Also, with probability at least 1 — 4e~^, 

Pij^ <L* + c{\/L*r* + f*). 

The proof of this theorem is given in Appendix A. 2. 
This theorem has the same fiavor as Theorem 4.2 of [19]. We have not 
used any property besides the positivity of the functions in the class. This 
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indicates that there might not be a significant gain compared to earlier re- 
sults (as without further assumptions the optimal rates are known). Indeed, 
a careful examination of this result shows that when L* > 0, the difference 
between Pi_^ and L* is essentially of order y/r*. For a class of {0, l}-valued 

functions with VC-dimension d, for example, this would be ^/dAognpn. On 
the other hand, the result of [19] is more refined since the Rademacher av- 
erages are not localized around (as they are here), but rather around the 
minimizer of the empirical error itself. Unfortunately, the small ball in [19] is 
not defined as Pn^f ^ Pn^ / + ^ but as Pn^f < ^QPn^ f''^^' '^'^^^ means that in 
the general situation where L* > 0, since Pnt-j does not converge to with 
increasing n (as it is expected to be close to Plj which itself converges to 
L*), the radius of the ball around £ ^ (which is 15Pni / + '^ill ^^ot converge 
to 0. Thus, the localized Rademacher average over this ball will converge 
at speed \Jdjn. In other words, our Theorem 5.2 and Theorem 4.2 of [19] 
essentially have the same behavior. But this is not surprising, as it is known 
that this is the optimal rate of convergence in this case. To get an improve- 
ment in the rates of convergence, one needs to make further assumptions on 
the distribution P or on the class T . 

5.2. Improved results for the excess risk. Consider a loss function i and 
function class that satisfy the following conditions. 

1. For every probability distribution P there is an f* £ !F satisfying P^f* = 
■inif^rP^f- 

2. There is a constant L such that i is L-Lipschitz in its first argument: for 
ah y,yi,y2, 

\i{yi,y) - Ky2,y)\ <L\yi-y2\- 

3. There is a constant -B > 1 such that for every probability distribution 
and every / G J^, 

p{f-n^<BP{ef-ef,). 

These conditions are not too restrictive as they are met by several commonly 
used regularized algorithms with convex losses. 

Note that condition 1 could be weakened, and one could consider a func- 
tion which is only close to achieving the infimum, with an appropriate change 
to condition 3. This generalization is straightforward, but it would make the 
results less readable, so we omit it. 

Condition 2 implies that, for all f £ 

piif-if*)^<L^pif-rf. 

Condition 3 usually follows from a uniform convexity condition on i. An 
important example is the quadratic loss £{y, y') = {y — y')"^ , when the function 
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class T is convex and uniformly bounded. In particular, if \ f{x) — y| G [0, 1] 
for all / G X and y (zy, then the conditions are satisfied with L = 2 

and B = 1 (see [18]). Other examples are described in [26] and in [2]. 

The first result we present is a direct but instructive corollary of Theo- 
rem 3.3. 

Corollary 5.3. LetT he a class of functions with ranges in [—1, 1] and 
let i be a loss function satisfying conditions 1-3 above. Let f be any element 
of T satisfying Pn^j = hii f^jr P^if . Assume ip is a sub-root function for 
which 

V^(r) > BLERnif G T:L^P{f - f*f < r}. 
Then for any x > and any r>'il){r), with probability at least 1 — e~^ , 

Proof. One applies Theorem 3.3 (first part) to the class Cf — if* with 
r(/) = L'^P{f - f Y and uses the fact that by Theorem A. 6, and by the sym- 
metry of the Rademacher variables, LEi?„{/ : L^P{f-f*f <r}> ERn{£f- 
if. : L^Pif - f*f < r}. The result follows from noticing that P„(^p -if )< 
0. 

□ 



Instead of comparing the loss of / to that of /*, one could compare 
it to the loss of the best measurable function (the regression function for 
regression function estimation, or the Bayes classifier for classification). The 
techniques proposed here can be adapted to this case. 

Using Corollary 5.3, one can (with minor modification) recover the results 
of [22] for model selection. These have been shown to match the minimax 
results in various situations. In that sense. Corollary 5.3 can be considered 
as sharp. 

Next we turn to the main result of this section. It is a version of Corol- 
lary 5.3 with a fully data-dependent bound. This is obtained by modifying 
-0 in three ways: the Rademacher averages are replaced by empirical ones, 
the radius of the ball is in the L2{Pn) norm instead of L2{P), and finally, 
the center of the ball is / instead of /*. 

Theorem 5.4. Let T be a convex class of functions with range in [—1, 1] 
and let i be a loss function satisfying conditions 1-3 above. Let f be any 
element of J- satisfying Pnl j = 'va.if^jr P^lf . Define 

(5.1) Mr) = ciE^Rnif G .F: P„(/ - ff < csr} + 
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where ci = 2L{B V lOL), C2 = IIL^ + ci and C3 = 2824 + 4B{11L + 27B) /c2 . 
Then with probability at least 1 — 4e~^, 

where f* is the fixed point of 'ipn ■ 



Remark 5.5. Unlike Corollary 5.3, the class J- in Theorem 5.4 has to 
be convex. This ensures that it is star-shaped around any of its elements 
(which implies that ipn is sub-root even though / is random). However, 
convexity of the loss class is not necessary, so that this theorem still applies 
to many situations of interest, in particular to regularized regression, where 
the functions are taken in a vector space or a ball of a vector space. 

Remark 5.6. Although the theorem is stated with explicit constants, 
there is no reason to think that these are optimal. The fact that the constant 
705 appears actually is due to our failure to apply the second part of The- 
orem 3.3 to the initial loss class, which is not star-shaped (this would have 
given a 7 instead). However, with some additional effort, one can probably 
obtain much better constants. 



As we explained earlier, although the statement of Theorem 5.4 is similar 
to Theorem 4.2 in [19], there is an important difference in the way the local- 
ized averages are defined: in our case the radius is a constant times r, while 
in [19] there is an additional term, involving the loss of the empirical risk 
minimizer, which may not converge to zero. Hence, the complexity decreases 
faster in our bound. 

The additional property required in the proof of this result compared to 
the proof of Theorem 4.1 is that under the assumptions of the theorem, the 
minimizers of the empirical loss and of the true loss are close with respect 
to the L2{P) and the L2{Pn) distances (this has also been used in [20] 
and [31, 32]). 

Proof of Theorem 5.4. Define the function 7/; as 

(5.2) i,{r) = ^Eii„{/ G F:L^P{f - <r} + i^iH^. 

2 n 

Notice that since is convex and thus star-shaped around each of its points. 
Lemma 3.4 implies that tp is sub-root. Now, for r > ip{r) Corollary 5.3 and 
condition 3 on the loss function imply that, with probability at least 1 — e~^, 

(5.3) L^Pif - rf < BL^Piif -if,)< 705L\ + ^^^^ ^ ^^^^^^'"^ . 



26 P. L. BARTLETT, O. BOUSQUET AND S. MENDELSON 

Denote the right-hand side by s. Since s>r>r*, then s > ipis) (by Lemma 3.2), 
and thus 

1 1 r?T 

s > WL^ERM eJ^:L^P{f - ff < s} + . 

n 

Therefore, Corohary 2.2 apphed to the class LJT yields that with probability 
at least 1 — e~^, 

{/ G J^, L^Pif - rf < 5} C {/ G T,L^Pn{f - rf < 2s}. 

This, combined with (5.3), implies that with probability at least 1 — 2e~^, 

P„(/-r)-<2f705r+ '"" + ™>^- 

V n 

(5.4) 



C2 



where the second inequality follows from r > 'il^{r) > cixjn. Define c = 2(705 + 
(llL + 27-B)i?/c2). By the triangle inequality in L^iPn)^ if (5.4) occurs, then 
any / G satisfies 



<{JPn{f-f*? + V^f. 



Appealing again to Corollary 2.2 applied to LT as before, but now for 
r > il^{r), it follows that with probability at least 1 — 3e~^, 

{feJ^:L'p{f-f*f<r} 

C {/ G ^ : L2p„,(/ - /)2 <{V2 + V~cfL\}. 

Combining this with Lemma A. 4 shows that, with probability at least 1 — 
4e"^, 



V'(r) < ciEMf G r-.L'Pif - rf <r} + 



n 



n 



< c^EMf ■■ PnU - n < (V2 + v~crr] + ■ 

< ciEMf ■■ PnU - r? < (4 + 2c)r} + ^ 

n 

Setting r = r* in the above argument and applying Lemma 4.3 shows that 
r* < f*, which, together with (5.3), concludes the proof. □ 
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6. Computing local Rademacher complexities. In this section we deal 
with the computation of local Rademacher complexities and their fixed 
points. We first propose a simple iterative procedure for estimating the 
fixed point of an arbitrary sub-root function and then give two examples 
of situations where it is possible to compute an upper bound on the local 
Rademacher complexities. In the case of classification with the discrete loss, 
this can be done by solving a weighted error minimization problem. In the 
case of kernel classes, it is obtained by computing the eigenvalues of the 
empirical Gram matrix. 

6.1. The iterative procedure. Recall that Theorem 4.1 indicates that one 
can obtain an upper bound in terms of empirical quantities only. However, it 
remains to be explained how to compute these quantities effectively. We pro- 
pose to use a procedure similar to that of Koltchinskii and Panchenko [16], 
by applying the sub-root function iteratively. The next lemma shows that 
applying the sub-root function iteratively gives a sequence that converges 
monotonically and quickly to the fixed point. 

Lemma 6.1. Let ip: [0, oo) [0, cxo) be a (nontrivial) sub-root function. 
Fix ro > r* , and for all k > define r^+i = ip[ri^). Then for all N > 0, 
i^N+i < fj\f, and 



then r^v < (1 + e)r* . 

Proof. Notice that if rk>r*, then rk+i = i>{rk) > i>{r*) = r* . Also, 



Notice that in the results of [16], the analysis of the iterative procedure 
was tied to the probabilistic upper bounds. However, here we make the issues 
separate: the bounds of previous sections are valid no matter how the fixed 
point is estimated. In the above lemma, one can use a random sub-root 
function. 




In particular, for any e > 0, if N satisfies 





and so ru+i < and r^+i/r* < (rk/r*) 
rN/r* <{ro/r*f~'' . □ 



An easy induction shows that 
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6.2. Local Rademacher complexities for classification loss classes. Con- 
sider the case where 3^ = {—1, 1} and the loss is the discrete loss, i{y,y') = 
l[y 7^ y']. Since £^ = £, one can write 

^aRnif G star(£^, 0) : Pnf < 2r} 

= E^Rniaif : a G (0, 1], / G Pj} < 2r/a^} 
= E^Rn{aif ■.ae{0,l],feT,Pnif< 2r/a^} 
= sup aE^Rn{£f:feT,Pnif<2r/a^} 

06(0,1] 

= sup aE^Rn{ef:feT,PJf<2r/a^}, 

ae[v^,l] 

where the last equality follows from the fact that Pn^f < 1 for all /. Substi- 
tuting into Corollary 5.1 gives the following result. 

Corollary 6.2. Let y = {±1}, let I he the discrete loss defined on y 
and let J- be a class of functions with ranges in y. Fix x > and define 

^ 26x 
Mr) =20 sup aE^Rn{lf:f£:F,PJf<2r/a^} + . 

Then for all JiT > 1, with probability at least 1 — 3e~^, for all f , 

where f* is the fixed point of V'n • 

The following theorem shows that upper bounds on ipn{r) can by com- 
puted whenever one can perform weighted empirical risk minimization. In 
other words, if there is an efficient algorithm for minimizing a weighted sum 
of classification errors, there is an efficient algorithm for computing an upper 
bound on the localized Rademacher averages. The empirical minimization 
algorithm needs to be run repeatedly on different realizations of the CTj, but 
with fast convergence toward the expectation as the number of iterations 
grows. A similar result was known for global Rademacher averages and this 
shows that the localization and the use of star-hulls do not greatly affect the 
computational complexity. 

Theorem 6.3. The empirical local Rademacher complexity of the clas- 
sification loss class, defined in Corollary 6.2, satisfies 

^ 26x 
Mr)=c sup aE^Rn{if:f£j^,Pnif<2r/a^} + 
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<c sup aE^ mm[ [ ^ -- + — y\ai + fiVil - J{fi) ] -\ , 



1 " 

J(/i) =min- V|(Jj + /ili|^(/(Xj),sign(cJi + 
1=1 

The quantity J{fJ.) can be viewed as the minimum of a certain weighted 
empirical risk when the labels are corrupted by noise and the noise level 
is determined by the parameter (Lagrange multiplier) /i. Using the fact 
that J(/u) is Lipschitz in /i, a finite grid of values of J(^) can be used 
to obtain a function (/) that is an upper bound on ipn- Then the function 
r ^ ^/rsupJ., (;/)(/)/ vV is a sub-root upper bound on ipn- 

In order to prove Theorem 6.3 we need the following lemma (adapted 
from [1]) which relates the localized Rademacher averages to a weighted 
error minimization problem. 

Lemma 6.4. For every b G [0,1], 

= i - E^mm{PJ{f{X),a):fe T,PJ{f{X),Y) < b}. 

Proof. Notice that for y,y' £ {±1}, £{y,y') = l[y / y'] = \y - y'\/2. 
Thus 

n 

2Y,ad{f{Xi),Yi)= J2 'T^\f{Xi)-l\+ J2 '^i\fiX^) + l\ 

i=l i.Yi=l i:Yi=-l 

= T^{2-\fiXi) + l\)+ ^^\fiX^) + l\ 

i:Yi=l i:Yi=-l 
n 

= 5]-yiC7i|/(x,) + i| + 2 Y ^- 

i=l i:Yi=l 

Because of the symmetry of dj, for fixed Xi the vector (—Yiai)f^i has the 
same distribution as (dj)"^^. Thus when we take the expectation, we can 
replace —Yiai by cjj. Moreover, we have 

n 

^a,|/(X,) + l|= E 1/(^0 + E -1/(^0 + 1| 

i=l i\cri = l i:crj = — 1 

= E (2-|/(X.)-l|)+ E -\fiXi) + i\ 

i: (Ti = l i: ai = — \ 



Y-\f{Xi)-cj,\ + 2 E 1' 
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implying that 



\ i : Yi = l i : ai = — l 



+ E^sup{-P„^(/(X), a) : / G ^, Y) < 6} j , 

which proves the claim. □ 

Proof of Theorem 6.3. From Lemma 6.4, 
'0n(r) = c sup a( ^ - E^min|p„£(/(X),cr) : 



) / n 

Fix a realization of the dj. It is easy to see that when /i > 0, each / for which 
Pni{f{X),Y) < 2rlo? satisfies 

Let L(/, /i) denote the right-hand side and let (7(/i) = minjgjrL(/, /x). Then 

min{P„£(/(X),c7) : / G .F, < 2r/Q2} > ^(/x). 

But, using the fact that y) = (1 — yy)/2, 

ff(;u) =min-^(^(/(X,),a,) +/i£(/(XO,y.)) - ^ 

. \^(\-nXi)G, l-f{X,)Yi \ 2r 

= mm — > h W 

fern^^y 2 ^ 2 / a2 

Iv^A , ^^|l-/(^i)sign(cJi + /zyi) IcTi + ^Fi 
: mm ■ 



m-^(^k. + /^>^d ^ 

1 +/1 2r 



^ 2 a2 
1 " 

min - V |iTj + sign(cri + /xFi)) 

f 6.:^=^ 77 

1=1 

1 ^, 2r 
2r7 ^ 2 



i=l 

Substituting gives the result. □ 
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6.3. Local Rademacher complexities for kernel classes. One case in which 
the functions ■0 and tpn can be computed exphcitly is when J-" is a kernel 
class, that is, the unit ball in the reproducing kernel Hilbert space associated 
with a positive definite kernel k. Observe that in this case ^ is a convex and 
symmetric set. 

Let k he a positive definite function on X, that is, a symmetric function 
such that for all n > 1 , 

n 

Vxi, . . . ,Xn G ^V, Vai, . . . ,Q;n G ^ aiajk{xi,Xj) > 0. 

Recall the main properties of reproducing kernel Hilbert spaces that we 
require: 

(a) The reproducing kernel Hilbert space associated with k is the unique 
Hilbert space 7i of functions on X such that for all f G J- and all x £ X, 
k{x, ■) £7i and 

(6.1) f{x) = {f,k{x,.)). 

(b) Ti. can be constructed as the completion of the linear span of the 
functions k{x,-) for x £ X, endowed with the inner product 

(n m \ n,m 

^aik{xi,-),^[3jk{yj,-) \ = ^ aiPjk{xi,yj). 
1=1 j=i I «j=i 

We use II • II to denote the norm in 7i. 

One method for regression consists of solving the following least squares 
problem in the unit ball of 7i: 

1 " 

mm -y{f{Xi)-Yif 



i=l 



Notice that considering a ball of some other radius is equivalent to rescaling 
the class. We are thus interested in computing the localized Rademacher 
averages of the class of functions 

^={/GW:||/||<l}. 

Assume that EA;(X, X) <oo and define T : L2{P) L2{P) as the integral 
operator associated with k and P, that is, T/(-) = / k{-,y)f{y)dP{y). It is 
possible to show that T is a positive semidefinite trace-class operator. Let 
{Xi)'^i be its eigenvalues, arranged in a nonincreasing order. Also, given an 
i.i.d. sample Xi, . . . ,X„ from P, consider the normalized Gram matrix (or 
kernel matrix) Tn defined as T„ = i(/c(Xj, Xj))jj=i^,,,^„. Let {Xi)^^i be its 
eigenvalues, arranged in a nonincreasing order. 

The following result was proved in [24]. 
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Theorem 6.5. For every r > 0, 



oo \ 1/2 



Moreover, there exists an absolute constant c such that if Xi > 1/n, then for 
every r> 1/n, 

oo \ 1/2 

-J2mm{r,\i}j . 

The following lemma is a data-dependent version. 
Lemma 6.6. For every r > 0, 

EMf^^-Pnf^<r}< -^min{r,\} . 



n . , 



The proof of this result can be found in Appendix A. 2. The fact that we 
have replaced Pf^ by Pnf^ and conditioned on the data yields a result that 
involves only the eigenvalues of the empirical Gram matrix. 

We can now state a consequence of Theorem 5.4 for the proposed regres- 
sion algorithm on the unit ball of Tl. 

Corollary 6.7. Assume that sup^^;^ k{x,x) < 1. LetT = {f £Ti.: \\f\\ < 
1} and let i be a loss function satisfying conditions 1-3. Let f be any element 
of T satisfying Pnif = inf/gjc-P„£j. 

There exists a constant c depending only on L and B such that with 
probability at least 1 — 6e~^, 



P{if-ir)<c{f* + ^., 



where 



r < mm — h 
0<h<n \ n 



\ 



n ~ 

i>h 



We observe that r* is at most of order l/\/n (if we take h = 0), but can 
be of order logn/n if the eigenvalues of T„ decay exponentially quickly. 

In addition, the eigenvalues of the Gram matrix are not hard to compute, 
so that the above result can suggest an implementable heuristic for choosing 
the kernel k from the data. The issue of the choice of the kernel is being 
intensively studied in the machine learning community. 
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Proof. Because of the symmetry of the ctj and because J- is convex 
and symmetric, 

^,Rn{f G T:PnU - ff < cgr} = EMf -f--f^^: Pn{f " f? < cgr} 

< E,i?„{/ - ^ : /, g G - gf < car} 

= 2E,i?„{/ : / G .F, P„/2 < c3r/4}. 



Combining with Lemma 6.6 gives 



/2 " r 

< 4ci — > min< 



4 1/ n 



Let ipnif) denote the right-hand side. Notice that ipn is a sub-root function, 
so the estimate of Theorem 5.4 can be apphed. To compute the fixed point 
of Bipn, first notice that adding a constant a to a sub-root function can 
increase its fixed point by at most 2a. Thus, it suffices to show that 

1/2 

/ z — , I c.i'r - I \ 

r < 



4ci(^|:min{^,A,}^ 



imphes 



(6.2) r < c min 

0<h<r 



"V" , 



for some universal constant c. Under this hypothesis, 

r \2 2 " . rear . 
- — < — > mm< — — , Aj 
4ci / - n ^ 14' 

1=1 

2 . />r^ Car 
= — mm > — 

n SC{l,...,n}\ts 4 



2 

— mm 

n o<h< 



\ t>h / 



Solving the quadratic inequality for each value of h gives (6.2). □ 

APPENDIX 

A.l. Additional materiaL This section contains a collection of results 
that is needed in the proofs. Most of them are classical or easy to derive 
from classical results. We present proofs for the sake of completeness. 
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Recall the following improvement of Rio's [29] version of Talagrand's con- 
centration inequality, which is due to Bousquet [7, 8]. 

Theorem A.l. Let c > 0, let Xi be independent random variables dis- 
tributed according to P and let T be a set of functions from X to M. Assume 
that all functions f in T satisfy E/ = and ||/||oo ^ c. 

Let a be a positive real number such that cr^ > supjgjp Var[/(Xi)]. Then, 
for any x > 0, 



In a similar way one can obtain a concentration result for the Rademacher 
averages of a class (using the result of [5]; see also [6]). In order to obtain 
the appropriate constants, notice that 



E^sup Va,/(X,) =E^sup V(Ti(/(X,) - (6- a)/2) 

and \ f-{h-a)/2\ < {h-a)/2. 

Theorem A. 2. Let T be a class of functions that map X into [a, Let 





n 



n 



n 



Z = E^sup V <Tif{Xi) = nE^RnJ^. 



Then for all x >0 




and 




Lemma A. 3. Foru,v>0 



and for any a> 



I — 

2\Juv < au H — . 

a 
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Lemma A. 4. Fix x > 0, and let T he a class of functions with ranges 
in [a,b]. Then, with probability at least 1 — e~^ , 

ERnT< inf (—^E^RnJ'+ ~ 



\e(Q,i)\l — a " Ana{\ — a)) 
Also, with probability at least 1 — e~^ , 

{b-a)x ( I 1 



E^RnJ" < inf ( 1 + a)Ei2„^ + ^ ^ ^ 

Proof. The second inequality of Theorem A. 2 and Lemma A. 3 imply 
that with probability at least 1 — e~^, 



ERnT < E^RnT + \l ^ — ^ERnJ^ 
V n 

< E^RnF + aERnT + 1^11^, 

Ana 

and the first claim of the lemma follows. The proof of the second claim is 
similar, but uses the first inequality of Theorem A. 2. □ 

A standard fact is that the expected deviation of the empirical means 
from the actual ones can be controlled by the Rademacher averages of the 
class. 

Lemma A. 5. For any class of functions T , 

maxfEsup(P/ - Pn/),Esup(P,/ - Pf)\ < 2ERnT. 

Proof. Let X(, . . . , X'^ be an independent copy of Xi, . . . , Xn, and set 
to be the empirical measure supported on X( , . . . , X'^ . By the convexity 
of the supremum and by symmetry, 

Esup(P/ - Pnf) = Esup(EP;;/ - Pnf) 

<ESUV{P'J - Pnf) 



— Esup 



Y,aJ{X'^-aif{Xi) 



1=1 



n 1 

< -Esup^(7,/(X0 + -Esup5^-a,/(X- 
= 2Esup Rnf. 
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Using an identical argument, the same holds for Pnf — Pf- D 

In addition, recall the following contraction property, which is due to 
Ledoux and Talagrand [17]. 

Theorem A. 6. Let (p be a contraction, that is, \(j){x) — (j){y)\ <\x — y\. 
Then, for every class J- , 

where (p o := {(p o f : f £ J^}. 

The interested reader may find some additional useful properties of the 
Rademacher averages in [3, 27]. 



A.2. Proofs. 



Proof of Theorem 2.1. Define y+ = sup f^yr{Pf - Pnf). Since supjgjrVar[/(Xj)] < 
r, and ||/ — P/||oo l^b — a, Theorem A.l implies that, with probability at 
least 1 — e~^', 



\ n n 3n 

Thus by Lemma A. 3, with probability at least 1 — e~^. 





1^ 






a 1 





\2rx 

Vib- 

Q>0\' ' V n 

Applying Lemma A. 5 gives the first assertion of Theorem 2.1. The second 
part of the theorem follows by combining the first one and Lemma A. 4, 
and noticing that infc/(a) +infQ(7(a) < inf„(/(Q;) +5(0)). Finally, the fact 
that the same results hold for supj-gjp(Pn/ — P f) can be easily obtained by 
applying the above reasoning to the class —T = {—f : f £ J^} and noticing 
that the Rademacher averages of — J- and are identical. □ 

Proof of Lemma 3.2. To prove the continuity of tp,let x> y> 0, and 
note that since is nondecr easing, \ip{x) — ip{y)\ = ip{x) — ip{y)- From the 
fact that ilj{r)/y/r is nonincreasing it follows that ip{x) / ^/y < y/x'ip{y)/y, 
and thus 

i/j^x) - Tp{y) = ^^^^ - i){y) < tlAy) ^ ■ 

Letting x tend to y, — ^p{y)\ tends to 0, and is left-continuous at y. 

A similar argument shows the right-sided continuity of tp. 
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As for the second part of the claim, note that ij){x)/x is nonnegative and 
continuous on (0,oo), and since is strictly decreasing on (0,oo), then 

ip{x)/x is also strictly decreasing. 

Observe that if ip{x) jx is always larger than 1 on (0, oo), then lim^^oo ^(2;)/ 
^Jx = 00, which is impossible. On the other hand, if 'ip{x)/x < 1 on (0,oo), 
then lim2.^o'0(^)/\/^ = 0) contrary to the assumption that is nontrivial. 
Thus the equation 'i/j{r)/r = 1 has a positive solution and this solution is 
unique by monotonicity. 

Finally, if for some r > 0, r > ^(r), then ip{t)/t < 1 for all t > r [since 
tp[x)/x is nonincreasing] and thus r* < r. The other direction follows in a 
similar manner. □ 

Proof of Lemma 3.4. Observe that, by symmetry of the Rademacher 
random variables, one has il){r) = Eo-i?n{/ — f ■ f £ J^, T{f — f) <r} so that, 
by translating the class, it suffices to consider the case where / = 0. 

Note that tp is nonnegative, since by Jensen's inequality 



Moreover, ip is nondecreasing since {/ £ !F:T{f) < r} C {/ G T:T{f) < r'} 
for r < r'. It remains to show that for any < ri <r2, ip{ri) > vn~/V2 •'0(^2)• 
To this end, fix any sample and any realization of the Rademacher random 
variables, and set /o to be a function for which 



is attained (if the supremum is not attained only a slight modification is 
required). Since T(/o) < r2, then T{y/ri/r2 • fo) < n by assumption. Fur- 
thermore, since JP" is star-shaped, the function y^riT^/o belongs to and 
satisfies that T{^/ri/r2fo) < ri. Hence 



and the result follows by taking expectations with respect to the Rademacher 
random variables. □ 

Proof of Corollary 3.7. The proof uses the following result of [11], 
which relates the empirical Rademacher averages to the empirical L2 entropy 
of the class. The covering number N{e,J-,L2{Pn)) is the cardinality of the 
smallest subset J- of L2{Pn) for which every element of J- is within e of some 
element of J-. 



EcrSUp Rnf > sup E^Rnf = 0. 



n 



sup ^(Tif{Xi) 
/e.^,T(/)<r2 i=i 
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Theorem B.7 ([11])- There exists an absolute constant C such that for 
every class T and every Xi, . . . , Xn S X , 

C 1"°° I 

\Jn Jo ^ 

Define the sub-root function 

V^fr) = 10Ei?„{/ E starfJ^, 0) : Pf <r} + 111^^. 

n 

If r > i^{r), then Corohary 2.2 implies that, with probabihty at least 1 — 
{/ E star(.F, 0) : Pf < r} C {/ E star(J-, 0) : Pnf < 2r}, 

and thus 

ERnif E star(.F, 0) : Pf <r}< Ei?„{/ E star(.F, 0) : Pnf < 2r} + -. 

n 

It follows that r* =ip[r*) satisfies 

(A.l) r* < 10Ei?„{/ E star(J-,0) < 2r*} + i±lil!^_ 

n 

But Theorem B.7 shows that 

Ei?„{/Estar(.F,0):P„/2<2r*} 

<^E / J\ogN{e,siai{T,Q),L2{Pn))de. 
yjn Jo ^ 

It is easy to see that we can construct an e-cover for star(.F, 0) using an 
e/2-cover for and an e/2-cover for the interval [0, 1], which implies 

logAA(e,star(.F,0),L2(P„))<logAA('|,.F,L2(P„J^ 



+ 1 



Now, recall that [14] for any probability distribution P and any class J- with 
VC-dimension d <oo, 

logAr(^-,J',L2{P)] <cdlog(- 



Therefore 



Ei?„{/Estar(.^,0):P„/2<2r*}<y^^ ' JlogQ) 



< 



' cdr* log(l/r*) 



n 



' d^ dr* login led) 
<\c\^^ 



n 
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where c represents an absolute constant whose value may change from line 
to line. Substituting into (A.l) and solving for r* shows that 

^ cdlog{n/d) 
r < , 

n 

provided n>d. The result follows from Theorem 3.3. □ 

Proof of Theorem 5.2. Let /* = argmin/gjrP£^. (For simplicity, as- 
sume that the minimum exists; if it does not, the proof is easily extended 
by considering the limit of a sequence of functions with expected loss ap- 
proaching the infimum.) Then, by definition of /, Pn^j ^ Pn^f*- Since the 
variance of if*(Xi,Yi) is no more than some constant times L* , we can ap- 
ply Bernstein's inequality (see, e.g., [10], Theorem 8.2) to show that with 
probability at least 1 — , 



Thus, by Theorem 3.3, with probability at least 1 — 2e~^', 



Pi.'<J^ L* + c J^ + ^]]+cK(r* + ^ 



f - K-1 

Setting 



n nil V n 



K-1 



' -max{L* , X / n) 



noting that r* > x/n and simplifying gives the first inequality. A similar 
argument using Theorem 4.1 implies the second inequality. □ 



Proof of Lemma 6.6. Introduce the operator Cn on H defined by 

1 " 

{Cnf){x) = -Y,fiX,)k{Xi,x), 

so that, using (6.1), 

{g,Cnf) = -Y.fiX,)g{X,), 



n . , 
1=1 



1 " 



n . , 



and {f,Cnf) = Pnf'^, implying that Cn is positive semidefinite. 

Suppose that / is an eigenfunction of C„ with eigenvalue A. Then for all i 



1 " 



XfiXi) = {Cnfm) = -J2f{Xj)k{X„X,). 



n . , 
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Thus, the vector {f{Xi),...,f{Xn)) is either zero (which imphes Cnf = 
and hence A = 0) or is an eigenvector of T„ with eigenvalue A. Conversely, 
if TnV = Xv for some vector v, then 

i=i ) ^i,j=i "j=i 

Thus, the eigenvalues of r„ are the same as the n largest eigenvalues of 
Cn, and the remaining eigenvalues of Cn are zero. Let (Aj) denote these 
eigenvalues, arranged in a nonincreasing order. 

Let {^i)i>i be an orthonormal basis of TC of eigenfunctions of Cn (such 
that is associated with Aj ) . Fix <h<n and note that for any f 

n In \ 

Y,a,fiX,) = {f,J2<^^k{X„■)) 



i=l 



i=l 



' ti n -. / n \ \ 

^ V A, ( /, ) CD,, 5: ^ ( ^ a. A; (X„ • ) , 



j=i \l Xj \ i=i 



If 



< 1 and 



then by the Cauchy-Schwarz inequality 

j2^^f{Xi)< 



i=l 



(A.2) 



Moreover, 



■3 \ 1=1 



+ 



, E(E^^M^.,-),^, 

\ j>h \ i=l I 



-¥.Jy^aik{Xi,-),^A =-E, V aMk{Xir),^j){k{Xu-),^j 



1=1 



il=l 
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Using (A. 2) and Jensen's inequality, it follows that 



E^Rn{feJ^:Pnf<r}<^ min 




Vhr+ Xj 



j=h+i 



n 



which implies the result. □ 
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